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Abstract 

We consider a stochastic process for the generation of species which combines a 
Yule process with a simple model for hybridization between pairs of co-existent 
species. We assume that the origin of the process, when there was one species, 
occurred at an unknown time in the past, and we condition the process on 
producing n species via the Yule process and a single hybridization event. We 
prove results about the distribution of the time of the hybridization event. In 
particular we calculate a formula for all moments, and show that under various 
conditions, the distribution tends to an exponential with rate twice that of the 
birth rate for the Yule process. 
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1. Introduction 

Hybridization has an important role in the evolution of new species [21 E] ■ 
In phylogenetic analysis, there is an increasing interest in dealing with this issue 
[7J [BJ [T2] . The usual phylogenetic tree is replaced by a phylogenetic network [5] , 
and in a Bayesian approach, a prior for the network is needed [BJ. Very little 
is known about suitable prior distributions for the topology and node times for 
such networks. This paper represents an attempt to understand the situation 
better, and provides some justification for using an exponential distribution as 
a prior for the hybridization time. 

The particular biological motivation for this study originates from a theo- 
retical question on the evolution of polyploidy in plants. Polyploids can arise 
from within a single species (autoployploids) or via hybridization between two 
species (allopolyploids) in which the genomes of the two parental species are 
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Figure 1: Main time characteristics of the of the conditional Yule tree for n = 4 species with 
one hybridization: T is the time to origin, Ti, . . . , T4 are inter-speciation times, and T4 is the 
time to hybridization. 

both present in the hybrid. For example, suppose it is known that a tetraploid 
species of interest resulted from a hybridization between a pair of diploid species 
which are ancestral to a clade of n extant species. The following question arises: 
what can we say about the time of the hybridization event prior to a phyloge- 
netic analysis of the genetic data? 

The same question can be applied to homoploid hybridization, in which 
there is a hybridization but no change in ploidy. However we will refer to the 
allopolyploid case above, since the species produced by the Yule process and 
the hybrids can be conveniently called diploids and tetraploids. 

We assume a Yule model with the speciation rate A conditioned on n extant 
species and model the hybridization events by a Poisson process with intensity 
f3 giving the number of hybridizations per pair of coexisting diploid species per 
unit of time calibrated by A. This means that if there are k coexisting diploid 
species during a time period t, then the number of hybridizations Nk(t) during 
this period has a Poisson distribution 

P(N k (t)=j) = ^ e -^>,j = 0,1,2,... (1) 

with expectation 

E(N k (t)) = p(*V (2) 

Counting time backwards, let T k stand for the time between two consecutive 
speciation events during which the Yule tree had k branches, k — 2, . . . , n, 
see Fig. [I] In the conditioned Yule model setting (a random phylogeny for n 
extant species under the assumption of an improper uniform prior for the time of 
origin [2]) the times (T2, . . . ,T„) are independent and exponentially distributed 
random variables with parameters (2A, . . . , nX) respectively. Replacing t by T k 
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in formula ^ and writing N k = N k (T k ) gives 



E(N k ) = p(f\E(T k )= 7 (k-l), 

where the compound parameter 7 = A- can be understood as a relative hy- 
bridization rate. Thus averaging over possible species trees results in the mean 
total number of hybridizations N = N 2 + . . . + N n being 

E(N)=-y(^\. (3) 

The main finding of this paper is that the distribution of the time r„ to 
a single hybridization event can be approximated by an exponential distribu- 
tion with parameter 2A. This obtained by showing, see Corollary [4j that r-th 
moment of 2Ar„ converges to r! which is the r-th moment of an exponential 
distribution with parameter 1. Our simulations show that even for moderate 
values of n and reasonable values of 7 the exponential approximation for the 
time to hybridization seem to be satisfactory. 



2. The single hybridization condition 

Given that there was exactly one hybridization event, N = 1, we denote by 
T n the time to hybridization counted backwards from the time of observation. 
If N = or N > 2, we put r„ = 00. In this section we show among other things 
that the single hybridization condition has probability 

n — 1 1 

P(r„<») = G„n T -, (4) 
A A 1 + 17 

2 — 1 

where G n = J2k=l i+1-y - Observe that if r n < 00, then for some K n G {2, . . . , n} 
hybridization occured during the period when there were K n ancestral species. 

Lemma 1. For any 2 < k < n < 00 

(A; -1)7 



P(n n = k\r n < 00) = G n x : 



1 + (A-1)7" 

PROOF of Lemma [T] Replacing t by T k in the right hand side of ^ yields 
P(N k =0\T 2 ,...,T n ) = e~^) T \ 
P(N k = l\T 2 ,...,T n ) = p(^JT k e-^) T », 

and since 

K = k} = {N n = 0, . . .,JVjh-i =0,N k = l,N k -! = 0, . . . , N 2 = 0}, 
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we obtain 



P( Kn = k\T 2 ,...,T n ) = (3( k )T k f[e-^) T % (5) 



and therefore 

pu - ft) - p ® TT Xi - (fc " 1)7 TT 1 (6) 

Summing over k — 2, ... , n we arrive at (BJ), then the assertion of Lemma [I] 
follows by dividing the last expression by (H. 

If we assume that n diploids and a single hybridization have been observed, 
then we can apply two basic methods of estimation for the plausible value of the 
key parameter 7. The method of moments estimate j n = 1/(2) ^ s immediately 
obtained from ([3| by substituting the observed value N = 1 for E(N). We can 
also treat the expression for P(r n < 00) in Q as a likelihood function for 7 

" 1 71-1 , 

K7 



, ■ 17 ^— ^ 1 + &7 

i=l ' fe=l ' 

and from it find a maximum likelihood estimate j n . It turns out that for large 
n the two estimates are close in value 

2 2 

7« > —, tt for n > 2, and 7„ < — — for n > 4. (7) 

n(n — 1) n(n — 3) 

To show Q we observe first that the equation L'(ff) = for j n takes the 
form 

A(j) = jB(j) 2 , (8) 

where 

By the Cauchy-Schwarz inequality we have 
*%— 1 fe ^ 2 

■ fe=i i=i 

Ti—l k 



fe=l i=l 



fc=l <=1 fc=l »=1 

which together with ([8]) yields 1 < j n n ( n ~ 1 '> . On the other hand, since for x > 

Tl\Tl — 1 ) 

5(x) > A(x) and (1 + na^x) > ^ 

it follows from dsb that 1 + n7„ > -y„ n ("~ 1 ) an d 1 > -y n n ("~ 3 ) _ 
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3. Exact formula for any moment of T n 
Lemma 2. For any r > 1 



I n—± , n— 1 n— 1 n — 1 

w.<») = G; I jEiiE E- E 



fe=i 



\—kii—i\ i r —i r —i 



where dj — (1 + j) 1 (1 + 7j) . 



Proof of Lemma [2] Under the Poisson model for the flow of hybridization 
events 



r n = X+ T v 
where X is a random variable uniformly distributed on [0,T Kn ]. Thus 



(9) 



E{r r n \n n = k)=Ei (x + £ T 3 



An 



j=k+i 

r! 

a k \ ■ ■ ■ a n \ 



X ak Yl 

i=k+l 



where the sum is over all vectors a = (ojfc, . . . a n ) of non- negative integers with 
sum r. Next we take such an a and calculate the expectation of 



--k}- 



i=k+l 



We have in view of ^ 



E(M k , a ) = e (^r- 1 j\*»dx^x n ^^Q^n^ 



k-1 

3=2 
k-1 



(i)^E( T ^- K ' )T ^ Tl 



i=fc+l 



1 



x j(k - 1)- > 



J=2 

Recalling ([6| we deduce 



+ (i-l)7 " ,v " ^(l + (fc-l) 7 ) 2 " ^ 1 + (i-l)7" 



n i 



E(M k<a ) 



(fc-l)7 
1 + (fc - 1)7 ' 



n— 1 1 n 



= P( Kn =k)\- r l[a i \d'*L 1 , 
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which implies 



E«K=k)=-j2ii^=y e ir 

a i— k ii,... 7 i r j — 1 

k—l<ii---<i r <n—l 

Now to finish the proof of Lemma [2] it remains to apply Lemma [TJ 
In particular, for r = 1 and r = 2 Lemma [2] gives 

71—1 , 71—1 

m n := E [t„|t„ < oo] = \~ 1 G~ 1 ]T — -L- £ d Jf (10) 

fc=i 7 j=fc 

and 

n— 1 , n— 1 n— 1 

E [r 2 |r„ < oo] = 2A-G- 1 E ^ E E 

71—1 j ri—1 9 n— 1 

implying 

n— 1 7 n— 1 2 n— 1 

Var[r„|r n <cx)]=G ? 7 1 ^ r ^{(A- 1 ^d J -m n ) + A- 2 £d 2 }. (11) 

Here we have used the following observation: in terms of Y n :— A -1 X)j=« -i 
we have m n = E [Y n ] and 

n— 1 , n— 1 

Var [r„|r„ < oo] = E [F n 2 ] - m 2 + A^G" 1 £ — ^- £ d) 

k=l ' j=k 

71—1 i 71—1 

= E[(y„-m„) 2 ] + A- 2 G- 1 E r Tl^E4 

fe=l ' j=k 

4. Convergence to an exponential distribution 

For 2 < k < n < oo and any natural number r define rj ly k,n and C7,n,r by 

P(n n < k\r n < oo) = ^ — + fy kn ), 
n(n — 1) 

E [(2AT n ) r |r„ < oo] = r!(l — C~i,n,r)- 

Theorem 3. For 2 < k < n < oo and r > 1 the following bounds are valid 

-fry < Tj ltk>n < nj, (12) 
< C 7 ,n,r < (1 + (r + l)n) 7 . (13) 
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Figure 2: The upper bound in (|13| for 7 = equals n ^ n _^ H — „_i ■ This upper bound 

is illustrated by plotting three functions of n for the first three moments r = 1, 2, 3. 



The discussion in the end of Sectionj^concerning ^ showed that it is important 
to consider the values of 7 close to ■ 111 Figure |2j we plotted the upper 

bounds in ( 13 1 with 7 = n ^-i) as functions of n for the first three moments 
r = 1,2,3. 

Corollary 4. Uniformly over all (k,n) such that 2 < k < n < 00 

2(k — 1) 

P(n n = k\r n < 00) -> —. -{, n 7 ^0. (14) 

n(n — 1) 

Moreover, as wy — > for any fixed natural number r 

E[(2AT„) r |r n <oo]^r! (15) 

uniformly over A € (0, 00). 

Corollary[4]is a straightforward consequence of Theorem [3] proved next. Note 
that in the case 7 = rt („ 2 _i) the condition — > is equivalent to n — > 00. 
Proof of Theorem [3j According to Lemma [I] 

P(k„ < fc|r n < 00) = G k /G n , 



and (12) follows from 



1 v-Wsij-Y ,16) 



1 + n 7 ' V2 y ~ 1 +7 V 2 
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To prove (13) observe first that 



n— In— 1 n— 1 / 1 ., 

y k y... y — - 

n— 1 i r * 2 / 1 1 \ * I 

S ■ ^ , " ' S U + ' ' ' i + vj ^ ' 



1 ii—k 

n— 1 i 



n— 1 13 



ir = l »2 = 1 



2 £ " ' ^-I V 1 + i 2 " ' 1 + i- J S 



Clearly, for any l<fc<ii<i2<---<i r <n— 1 we have 
7 fc 



(l + n 7 )''+ 1 + ■■■{1 + ir) 



7 fc 
+ 11 "' ir - (l + jfe7)H-i (l + il )...(l + ir )- 



Thus Lemma [2] yields 



G„(l + n 7 )-+ 



- <E[(2A<T|r„ < oo] < r! 



7(3) 



and applying ( 16 ) we get inequalities 



(1 + 7 )(1 + n 7 )' r + 



Y < E [(2Ar„) r |r n < oo] < 



resulting in ( 13 1. 



5. Simulation results and discussion 

We have checked and illustrated our analytical results using simulations. 
Our simulation algorithm is based on the following steps to obtain a single 
hybridization time: 

Step 1 For [k = 1 to n): T k 4— sample from the exponential distribution with 
rate k\. 

Step 2 For (k = 1 to n): r k <- T k /3k(k - l)/2. 
Step 3 R^El=2 r k- 

Step 4 h <— sample from the Poisson distribution with mean R. 
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Figure 3: Conditional mean and variance of r n as functions (JTOJI and of the number n of 
candidate species. Simulations with A = 1 and /3 = n (^—i) are compared to the analytical 
predictions. 



Step 5 If (h —— 1) then sample k G {2,3, ... ,n} with probability proportional 
to 77., and then the hybridization time uniformly in the fcth interval. 

In Figure [3] the mean and variance of t„ as functions (10) and (11) of the 
number n of candidate species are drawn against the values obtained from sim- 
ulations. Here A = 1 and 7 = § = n (n-i) w ith n ranging from 2 to 200. 

Figure [4] shows simulated conditional distributions of r n . We can see how 
the observed distribution profile approaches the exponential curve as n increases 
from 2 to 20. 

The Yule model for the unknown species tree is not very realistic but it is a 
very convenient tool for phylogenetic calculations, see for example [3]. There- 
fore, the presented here results should be viewed as just a starting point for the 
issues raised in this paper. More biologically relevant extensions of the model 
studied here should take into account the possibility of hybridization between a 
pair of ancestral species of which either one or both have no direct descendants 
at present. To include extinct species in the analysis one can use the so-called 
conditioned birth-death processes developed in [TJ H] and successfully used as 
species tree models for various purposes, see for example jTTj. An important 
additional parameter arising in this more general setting is the extinction rate 
[i for the ancestral species. 

An crucial biological feature missing in the classical birth-death processes 
modeling species trees is geographical structure. Obviously, the probability of 
hybridization is conditional on geographical proximity. This could presumably 
be taken into account by combining our model with some statistical biogeog- 
raphy model [10] . Another desirable feature missing in the current analysis 
is the decaying hybridization rate: the more divergent two species become, the 
less probable hybridization between them will be. Of course, one should not 
limit oneself only to one hybridization event allowing for multiple hybridizations. 
Furthermore, hybrids can speciate via ordinary speciation, and hybridizations 
between hybrids also occur, so these processes should be included in a general 
model. 
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Figure 4: Histograms for r n conditional on a single hybridization event for the number n of 
candidate species. Left to right: n = 2, 3, 5, 10, 20. Simulations with A = 1 and (3 = n , 4 _ 1 - . 
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